TD - Chapter details v2

Chapter details

The English LPC consists of the following components, executed in a sequence:

Paragraph splitter – based on regular expressions „((^.*\S+.*$)+)”. More information can be found in the com.tetracom.uima.text.ParagraphSplitter class code.
URL and Email annotator – based on regular expressions. The URLs and emails contain „.” (dot), which confuses the subsequent components. Therefore, URLs and Emails found in the text are annotated as named entities and skipped by the other annotators in the chain.
Sentence splitter – the English sentence splitter uses OpenNLP1 SentenceDetectorME from the opennlp.tools.sentdetect package for splitting up raw text into sentences. A maximum entropy model is used to evaluate the characters ".", "!", and "?" in a string and to determine if they signify the end of a sentence. More information can be found at: http://opennlp.sourceforge.net/api/opennlp/tools/sentdetect/SentenceDetectorME.html.
Tokenizer – the tokenizer uses he OpenNLP TokenizerME from the opennlp.tools.tokenize package. The current implementation of the tokenizer instance is not thread safe, thus a separate tokenizer must be instantiated for each thread. However, the TokenizerModel instance can be reused for each of the tokenizer instances in order to save memory. More information can be found at: http://opennlp.sourceforge.net/api/opennlp/tools/tokenize/TokenizerME.html.
POS Tagger – uses the POSTaggerME from the opennlp.tools.postag package. All punctuation characters are marked with „PU”. More information can be found at: http://opennlp.sourceforge.net/api/opennlp/tools/postag/POSTaggerME.html
Lemmatizer – uses the Morphological Analysis tool from the RASP (Robust Accurate Statistical Parsing) system (RASP System second distribution RASPv2). The development of the RASP system was funded by the UK EPSRC within the project "Robust Accurate Statistical Parsing (RASP)" (grants GR/N36462 and GR/N36493). Since the end of that project, it is still being extended and enhanced on an on-going basis. The tagset this tool uses is close to CLAWS C7 although it is in fact a cut-down version of the CLAWS C2 tagset. The POS tagset, used by the OpenNLP POS tagger, has to be converted to the CLAWS C2 tagset in order to use the RASP lemmatizer in the LPC. A new version of the RASP system became available at the time of writing of this document. The new version 3 will be adopted in the English LPC by the end of the project.
Noun phrase extractor – The grammar and structure of the English noun phrase are described in a set of 14 rules, following the format of a ParseEst sub-component.
Named entities recognizer – NEs are extracted using the OpenNLP NameFinderME from the opennlp.tools.namefind package. The tool recognizes seven different types of named entities – date, time, location, money, organization, percentage and person. Tetracom added two additional named entities to be recognized, using regular expressions – e-mails and URLs.

Sentence splitter, Tokenizer, POS tagger and primary named entity recognizer for English are based on OpenNLP project (http://opennlp.sourceforge.net/). OpenNLP hosts a variety of Java-based NLP tools which perform sentence detection, tokenization, pos-tagging, chunking and parsing, named-entity detection, and coreference. All OpenNLP tools are working with Penn Treebank tagset (http://bulba.sdsu.edu/jeanette/thesis/PennTags.html).

Technical

English LPC

The Language Processing Framework

Common language processing tools

Bulgarian LPC

English LPC

German LPC

Greek LPC

Polish LPC

Romanian LPC